TF perturbation Analysis with CellOracle¶
Author: Alvaro Regano (alvaro.regano@gmail.com)¶
Overview¶
This notebook describes how to run in silico TF perturbations with the GRN models produce in the notebook: Network_analysis_scAGM_EHT.ipynb The notebook uses CellOracle for TF perturbation analysis (Kamimoto, K., Stringa, B., Hoffmann, C. M., Jindal, K., Solnica-Krezel, L., & Morris, S. A.(2023). Dissecting cell identity via network inference and in silico gene perturbation.Nature, 614(7949), 742-751. https://doi.org/10.1038/s41586-022-05688-9). It has been adapted to the project of endothelial to hematopoietic transition with E10.5 mouse embryos.
Notebook based on this notebook produced by Kamimoto. K:
Data¶
In this notebook, CellOracle uses two types of input data.
- Input data 1: Oracle object.
I will use the celloracle object produced from Endothelial cells from E10.5 aorta gonad mesonephros of mouse embryos, which was processed in the Network_analysis_scAGM_EHT.ipynb Notebook.
- Input data 2: Links object.
I will use the links object produced from Endothelial cells from E10.5 aorta gonad mesonephros of mouse embryos and the scATAC data from GSE137115 accession number from Zhu et al. (2020), which was processed in the Network_analysis_scAGM_EHT.ipynb Notebook.
What you can do¶
In this notebook, I perform two analyses.
in silico TF perturbation to simulate cell identity shifts. CellOracle uses the GRN model to simulate cell identity shifts in response to TF perturbations. For this analysis, you will need the GRN models from the previous notebook. In this analysis, I used KNN imputation before constructing the celloracle object for calculating vector shifts in response to TF perturbations.
Compare simulation vectors with developmental vectors. In order to properly interpret the simulation results, it is also important to consider the natural direction of development. First, we will calculate a pseudotime gradient vector field using a pseudotime trayectory previously omputed with Palantir to recapitulate the developmental flow. Then, we will compare the CellOracle TF perturbation vector field with the developmental vector field by calculating the inner product scores. Let's call the inner product value as perturbation score (PS). Please see the step 5.6 for detail.
Custom data class / object¶
In this notebook, CellOracle uses four custom classes, Oracle, Links, Gradient_calculator, and Oracle_development_module.
Oracleis the main class in the CellOracle package. It is responsible for most of the calculations during GRN model construction and TF perturbation simulations.Linksis a class to store GRN data.The
Gradient_calculatorcalculates the developmental vector field from the pseudotime results. Here pseudotime trayectory was calculated using Palantir.The
Oracle_development_moduleintegrates theOracleobject data and theGradient_calculatorobject data to analyze how TF perturbation affects on the developmental process. It also has many visualization functions.
import os
import sys
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
This notebook was made with celloracle version 0.10.8. Please use celloracle>=0.10.8 Otherwise, you may get an error.
import celloracle as co
co.__version__
'0.18.0'
#plt.rcParams["font.family"] = "arial"
plt.rcParams["figure.figsize"] = [6,6]
%config InlineBackend.figure_format = 'retina'
plt.rcParams["savefig.dpi"] = 600
%matplotlib inline
0.1. Make a folder to save graph¶
# Make folder to save plots
save_folder = "../../Plots/Paper/CellOracle/New/"
os.makedirs(save_folder, exist_ok=True)
oracle = co.load_hdf5("../../celloracle/scAGM_EHT_Zhu_byCondition_from_Aorta_ECs.celloracle.oracle")
oracle
Oracle object
Meta data
celloracle version used for instantiation: 0.17.1
n_cells: 1387
n_genes: 2007
cluster_name: Condition
dimensional_reduction_name: X_umap
n_target_genes_in_TFdict: 16077 genes
n_regulatory_in_TFdict: 1094 genes
n_regulatory_in_both_TFdict_and_scRNA-seq: 142 genes
n_target_genes_both_TFdict_and_scRNA-seq: 1583 genes
k_for_knn_imputation: 34
Status
Gene expression matrix: Ready
BaseGRN: Ready
PCA calculation: Done
Knn imputation: Done
GRN calculation for simulation: Not finished
1.2. Load inferred GRNs¶
In the previous notebook, we calculated the GRNs. Now, we will use these GRNs for the perturbation simulations. First, we will import the GRNs from the Links object.
links = co.load_hdf5("../../celloracle/scAGM_EHT_Zhu_byCondition_from_Aorta_ECs.links.celloracle.links")
2. Make predictive models for simulation¶
Here, we will need to fit the ridge regression models again. This process will take less time than the GRN inference in the previous notebook, because we are using the filtered GRN models.
links.filter_links()
oracle.get_cluster_specific_TFdict_from_Links(links_object=links)
oracle.fit_GRN_for_simulation(alpha=10,
use_cluster_specific_TFdict=True)
0%| | 0/7 [00:00<?, ?it/s]
3. In silico TF perturbation analysis¶
Next, we will simulate the TF perturbation effects on cell identity to investigate its potential functions and regulatory mechanisms. Please see the CellOracle paper for more details on scientific rationale.
In this notebook, we will simulate the knockout of the Gata1 gene in the hematopoiesis trajectory.
Previous studies have shown that Gata1 is one of the TFs that regulates cell fate decisions in myeloid progenitors. Additionally, Gata1 has been shown to affect erythroid cell differentiation.
Here, we will use CellOracle to analyze Gata1 and attempt to recapitulate the previous findings from above.
3.1. Check gene expression pattern.¶
# Check gene expression
goi = "Hey1"
# Try with the following TFs : "Gata2" "Runx1" "Hey1" "Hey2" "Hes1" "Gfi1" "Mycn" "Myc"
# Plot gene expression in histogram
# Imputed counts
sc.get.obs_df(oracle.adata, keys=[goi], layer="imputed_count").hist()
# Normalized counts
sc.get.obs_df(oracle.adata, keys=[goi], layer="raw_count").hist()
plt.show()
3.2. Calculate future gene expression after perturbation.¶
You can use any gene expression value in the in silico perturbations, but please avoid extremely high values that are far from the natural gene expression range. The upper limit allowed is twice the maximum gene expression.
In this analysis I looked into KO perturbations, simulating GOI to 0.
To simulate Hey1 KO, we will set Hey1 expression to 0.
# Enter perturbation conditions to simulate signal propagation after the perturbation.
# Gata2 = 0.5
# Runx1 = 1.75
# Hey1 = 1.0
# Hey2 = 0.25
# Hes1 = 0.75
# Gfi1 = 0.25
# Mycn = 0.5
# Try with the following TFs : "Gata2" "Runx1" "Hey1" "Hey2" "Hes1" "Gfi1" "Mycn" "Myc"
goi = "Hey1"
oracle.simulate_shift(perturb_condition={goi: 0.0},
n_propagation=3)
After simulation, celloracle automatically evaluates whether the range of simulated values is appropriate and returns a warning if there is an atypical distribution. You can also evaluate the distribution pattern of simulated values in detail yourself.
oracle.evaluate_and_plot_simulation_value_distribution(n_genes=4,
save=None)
3.3. Calculate transition probability between cells¶
The steps above simulated the global future gene expression shift after perturbation. This prediction is based on iterative calculations of signal propagation calculations within the GRN.
The next step calculates the probability of cell state transitions based on the simulation data. You can use the transition probabilities between cells to predict how cells will change after a perturbation.
This transition probability will be used later.
# Get transition probability
# From Aorta ECs: n_neighbors = 34
# With Arterial ECs: 44
oracle.estimate_transition_prob(n_neighbors=34,
knn_random=True,
sampled_fraction=1)
# Calculate embedding
oracle.calculate_embedding_shift(sigma_corr=0.05)
Caution: It is very important to find the optimal scale parameter.¶
We will need to adjust the
scaleparameter. Please seek to find the optimalscaleparameter for the data based on your data.If the vectors are not visible, you can try a smaller
scaleparameter to magnify the vector length. However, if you see large vectors in the randomized results (right panel), it means that the scale parameter is too small.
fig, ax = plt.subplots(1, 2, figsize=[13, 6])
# With Imputation
# scale = 40
scale = 25
# Show quiver plot
oracle.plot_quiver(scale=scale, ax=ax[0])
ax[0].set_title(f"Simulated cell identity shift vector: {goi} KO")
# Show quiver plot that was calculated with randomized graph.
oracle.plot_quiver_random(scale=scale, ax=ax[1])
ax[1].set_title(f"Randomized simulation vector")
plt.show()
4.2. Vector field graph¶
We will visualize the simulation results as a vector field on a digitized grid. Single cell transition vectors are grouped by grid point.
4.2.1 Find parameters for n_grid and min_mass¶
n_grid: Number of grid points.
min_mass: Threshold value for the cell density.
The appropriate values for these parameters depends on the data. Please find appropriate values using the helper functions below.
# n_grid = 40 is a good starting value.
n_grid = 40
oracle.calculate_p_mass(smooth=0.8, n_grid=n_grid, n_neighbors=44)
Please run oracle.suggest_mass_thresholds() to display a range of min_mass parameter values and choose a value to fit the data.
# Search for best min_mass.
oracle.suggest_mass_thresholds(n_suggestion=12)
According to the results, the optimal min_mass is around 0.011.
min_mass = 8
oracle.calculate_mass_filter(min_mass=min_mass, plot=True)
4.2.2 Plot vector fields¶
Again, we need to adjust the
scaleparameter. Please seek to find the optimalscaleparameter that provides good visualization.If you don't see any vector, you can try the smaller
scaleparameter to magnify the vector length. However, if you see large vectors in the randomized results (right panel), it means the scale parameter is too small.
fig, ax = plt.subplots(1, 2, figsize=[13, 6])
# With imputation
# scale_simulation = 40
# From Aorta ECs
scale_simulation = 12
# Without imputation NEED tTO SET UP EACH TF MANUALLY
# scale_simulation = 6
# Show quiver plot
oracle.plot_simulation_flow_on_grid(scale=scale_simulation, ax=ax[0])
ax[0].set_title(f"Simulated cell identity shift vector: {goi} KO")
# Show quiver plot that was calculated with randomized graph.
oracle.plot_simulation_flow_random_on_grid(scale=scale_simulation, ax=ax[1])
ax[1].set_title(f"Randomized simulation vector")
plt.show()
# Plot vector field with cell cluster
fig, ax = plt.subplots(figsize=[8, 8])
oracle.plot_cluster_whole(ax=ax, s=10)
oracle.plot_simulation_flow_on_grid(scale=scale_simulation, ax=ax, show_background=False)
5. [This step is optional] Compare simulation vector with development vectors¶
5.0. Method overview¶
As shown above, we can simulate how TF perturbations affect cell identity and visualize the results as a vector field map.
To interpret the results, it is necessary to take into account the direction of natural differentiation. We will compare the simulated perturbation vectors with the developmental gradient vectors. By comparing them, we can intuitively understand how TFs impact cell fate determination during development. This perspective is also important for estimating the experimental perturbation results.
Here, we will calculate the development vector field using pseudotime gradient as follows. This gradient was produced previously with Palantir.
Transfer the pseudotime data into an n x n digitized grid.
Calculate the 2D gradient of pseudotime to get vector field.
Compare the in silico TF perturbation vector field with the development vector field by calculating inner product between these two vectors.
- Note: Other methods may be used to represent a continuous scRNA-seq trajectory flow. For example, RNA velocity analysis is a good way to estimate the direction of cell differentiation. Choose the method that best suits your data.
5.1 Prepare pseudotime data¶
5.2. Check data¶
# Visualize pseudotime
fig, ax = plt.subplots(figsize=[6,6])
sc.pl.embedding(adata=oracle.adata, basis=oracle.embedding_name, ax=ax, cmap="plasma",
color=["palantir_pseudotime"], save="Pseudotime_Palantir_scAGM_EHT.png")
WARNING: saving figure to file figures/X_umapPseudotime_Palantir_scAGM_EHT.png
5.3. Make Gradient_calculator object¶
from celloracle.applications import Gradient_calculator
# Instantiate Gradient calculator object
gradient = Gradient_calculator(oracle_object=oracle, pseudotime_key="palantir_pseudotime")
We need to set n_grid and min_mass for the pseudotime grid point calculation too.
n_grid: Number of grid point.
We already know approproate values for them. Please set the same values as step 4.2.1 above.
gradient.calculate_p_mass(smooth=0.8, n_grid=n_grid, n_neighbors=34)
gradient.calculate_mass_filter(min_mass=min_mass, plot=True)
5.4 Transfer pseudotime values to the grid points.¶
Next, we convert the pseudotime data into grid points. For this calculation we can chose one of two methods.
knn: K-Nearest Neighbor regression. You will need to set number of neighbors. Please adjustn_knnfor best results.This will depend on the population size and density of your scRNA-seq data.
gradient.transfer_data_into_grid(args={"method": "knn", "n_knn":50})
polynomial: Polynomial regression using x-axis and y-axis of dimensional reduction space.
In general, this method will be more robust. Please use this method if knn method does not work.
n_poly is the number of degrees for the polynomial regression model. Please try to find appropriaten_poly searching for best results.
gradient.transfer_data_into_grid(args={"method": "polynomial", "n_poly":3})
gradient.transfer_data_into_grid(
# args={"method": "knn", "n_knn":34},
args={"method": "polynomial", "n_poly":2},
plot=True)
5.5. Calculate Gradient vectors¶
Calculate the 2D vector map to represents the pseudotime gradient. After the gradient calculation, the length of the vector will be normalized automatically.
Please adjust the scale parameter to adjust vector length visualization.
# Calculate graddient
gradient.calculate_gradient()
# Show results
scale_dev = 12
gradient.visualize_results(scale=scale_dev, s=5)
gradient.plot_pseudotime(cmap = "plasma")
gradient.plot_reference_flow_on_grid(show_background=False)
plt.title("Pseudotime + \nDevelopment flow")
# plt.savefig(''.join([save_folder, '/Palantir_pseudotime_Dev_flow.png']), dpi = 'figure', bbox_inches='tight')
# plt.savefig(''.join([save_folder, '/Palantir_pseudotime_Dev_flow.pdf']), format = 'pdf', bbox_inches='tight')
Text(0.5, 1.0, 'Pseudotime + \nDevelopment flow')
# Visualize results
fig, ax = plt.subplots(figsize=[6, 6])
gradient.plot_dev_flow_on_grid(scale=scale_dev, ax=ax)
5.6. Calculate Inner product between two vectors¶
We will use the inner product calculations to quantitatively compare the 2D developmental flow and perturbation simulation vectors.
If you are not familiar with Inner product / Dot product, please see https://en.wikipedia.org/wiki/Dot_product
The inner product represents the similarity between two vectors.
Using the inner product, we compare the 2D vector field of perturbation simulation and development flow.
The inner product will be a positive value when two vectors are pointing in the same direction.
Conversely, the inner product will be a negative value when two vectors are pointing in the opposing directions.
- The length of the vectors also affects the magnitude of inner product value.
In summary, we quantitatively compare the directionality and size of vectors between perturbation simulation and natural differentiation using inner product, and we define the score as perturbation score (PS).
A negative PS means that the TF perturbation would block differentiation .
A positive PS means that the TF perturbation would promote differentiation .
from celloracle.applications import Oracle_development_module
# Make Oracle_development_module to compare two vector field
dev = Oracle_development_module()
# Load development flow
dev.load_differentiation_reference_data(gradient_object=gradient)
# Load simulation result
dev.load_perturb_simulation_data(oracle_object=oracle)
# Calculate inner produc scores
dev.calculate_inner_product()
dev.calculate_digitized_ip(n_bins=10)
5.7. Show results¶
Here, need to adjust the
vmparameter for the PS score color visualization. Please seek to find the optimalvmparameter that provides good visualization.If you don't see any color in the left panel below, you can try the smaller
vmparameter to magnify the scale of vm visualization. However, if you see colors in the randomized results (right panel), it means the vm parameter is too small.
# Show perturbation scores
from celloracle.visualizations.config import CONFIG
CONFIG["cmap_ps"] = "PiYG"
# With Imputation
vm = 1.0
# Without Imputation
vm = 0.4
fig, ax = plt.subplots(1, 2, figsize=[12, 6])
dev.plot_inner_product_on_grid(vm=0.5, s=50, ax=ax[0])
ax[0].set_title(f"PS")
dev.plot_inner_product_random_on_grid(vm=vm, s=50, ax=ax[1])
ax[1].set_title(f"PS calculated with Randomized simulation vector")
plt.show()
# Show perturbation scores with perturbation simulation vector field
save_folder = "../../Plots/Paper/CellOracle/New/Perturbations"
os.makedirs(save_folder, exist_ok = True)
fig, ax = plt.subplots(figsize=[6, 6])
dev.plot_inner_product_on_grid(vm=vm, s=50, ax=ax)
dev.plot_simulation_flow_on_grid(scale=scale_simulation, show_background=False, ax=ax)
plt.title(f"{goi} KO perturbation")
# plt.savefig(''.join([save_folder, f'/{goi}_KO_Perturbation_scAGM_EHT.png']), dpi = 'figure', bbox_inches='tight')
# plt.savefig(''.join([save_folder, f'/{goi}_KO_Perturbation_scAGM_EHT.pdf']), format = 'pdf', bbox_inches='tight')
Text(0.5, 1.0, 'Hey1 KO perturbation')
# Let's visualize the results
dev.visualize_development_module_layout_0(s=5,
scale_for_simulation=scale_simulation,
s_grid=50,
scale_for_pseudotime=scale_dev,
vm=vm)
2024-08-08 15:39:10,149 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting. 2024-08-08 15:39:10,152 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
6. Focus on Control condition to interpret the results in detail¶
So far, we have used the Oracle_development_module to analyze the whole cell population.
However, the Oracle_development_module can also analyze specific subpopulations.
In this example, I analyzed Control conditions
# Get cell index list for the cells of interest
clusters = ['Control']
cell_idx = np.where(oracle.adata.obs["Condition"].isin(clusters))[0]
dev = Oracle_development_module()
# Load development flow
dev.load_differentiation_reference_data(gradient_object=gradient)
# Load simulation result
dev.load_perturb_simulation_data(oracle_object=oracle,
cell_idx_use=cell_idx, # Enter cell id list
name="Control" # Name of this cell group. You can enter any name.
)
# Calculation
dev.calculate_inner_product()
dev.calculate_digitized_ip(n_bins=10)
# Let's visualize the results
dev.visualize_development_module_layout_0(s=5,
scale_for_simulation=scale_simulation,
s_grid=50,
scale_for_pseudotime=scale_dev,
vm=vm)
2024-08-08 15:39:27,892 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting. 2024-08-08 15:39:27,895 - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
Using only Control cells causes a substantial loss of cells affecting the grid and the interpretability of the data
import session_info
session_info.show()
Click to view session information
----- celloracle 0.18.0 matplotlib 3.6.3 numpy 1.24.4 pandas 1.5.3 scanpy 1.9.8 seaborn 0.13.2 session_info 1.0.0 -----
Click to view modules imported as dependencies
Bio 1.83 PIL 10.4.0 anndata 0.9.2 anyio NA appdirs 1.4.4 arrow 1.3.0 asttokens NA attr 24.2.0 attrs 24.2.0 babel 2.15.0 backcall 0.2.0 backports NA biothings_client 0.3.1 certifi 2024.07.04 cffi 1.17.0 charset_normalizer 3.3.2 colorama 0.4.6 comm 0.2.2 compat NA cycler 0.12.1 cython_runtime NA dateutil 2.9.0.post0 debugpy 1.8.5 decorator 5.1.1 defusedxml 0.7.1 diskcache 5.6.3 exceptiongroup 1.2.2 executing 2.0.1 fastjsonschema NA filelock 3.15.4 fqdn NA ftpretty NA genomepy 0.16.1 gimmemotifs 0.17.2 goatools 1.4.12 h5py 3.11.0 idna 3.7 igraph 0.11.6 importlib_metadata NA importlib_resources NA ipykernel 6.29.5 ipywidgets 8.1.3 isoduration NA iteround NA jaraco NA jedi 0.19.1 jinja2 3.1.4 joblib 1.4.2 json5 0.9.25 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema_specifications NA jupyter_events 0.10.0 jupyter_server 2.14.2 jupyterlab_server 2.27.3 kiwisolver 1.4.5 llvmlite 0.41.1 logomaker NA loguru 0.7.2 loompy 3.0.7 louvain 0.8.2 markupsafe 2.1.5 matplotlib_inline 0.1.7 more_itertools 10.3.0 mpl_toolkits NA mygene 3.2.2 mysql NA natsort 8.4.0 nbformat 5.10.4 networkx 3.1 norns NA numba 0.58.1 numpy_groupies 0.9.22 overrides NA packaging 24.1 parso 0.8.4 patsy 0.5.6 pexpect 4.9.0 pickleshare 0.7.5 pkg_resources NA platformdirs 4.2.2 prometheus_client NA prompt_toolkit 3.0.47 psutil 6.0.0 ptyprocess 0.7.0 pure_eval 0.2.3 pyarrow 17.0.0 pybedtools 0.10.0 pycparser 2.22 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pyfaidx 0.8.1.2 pygments 2.18.0 pyparsing 3.1.2 pysam 0.22.1 pythonjsonlogger NA pytz 2024.1 referencing NA requests 2.32.3 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rich NA rpds NA scipy 1.10.1 send2trash NA six 1.16.0 sklearn 1.3.2 sniffio 1.3.1 stack_data 0.6.3 statsmodels 0.14.1 texttable 1.7.0 threadpoolctl 3.5.0 tornado 6.4.1 tqdm 4.66.5 traitlets 5.14.3 typing_extensions NA uri_template NA urllib3 2.2.2 velocyto 0.17.17 vscode NA wcwidth 0.2.13 webcolors 24.6.0 websocket 1.8.0 xdg NA xxhash NA yaml 6.0.2 zipp NA zmq 26.1.0
----- IPython 8.12.3 jupyter_client 8.6.2 jupyter_core 5.7.2 jupyterlab 4.2.4 notebook 7.2.1 ----- Python 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] Linux-6.8.0-39-generic-x86_64-with-glibc2.17 ----- Session information updated at 2024-08-08 15:44